Edit distance and comparison scores
ثبت نشده
چکیده
The simplest form of sequence comparison is based on editing strings and counting the number of edits required to get from one sequence to the other. The formulation of string-edit distance de balances two different types of edits. The simplest is replacement of a single letter by another letter. To start with, we need a metric on the set of letters in the alphabet Σ for our sets of sequences. Let DΣ be a metric on the alphabet Σ. Then DΣ(ξ, η) measures the difference between the two letters ξ and η in Σ. Now we show how to extend this to a metric de on the set of all finite strings Σ . If two strings x and y differ only in the k-th position, then we set de(x, y) = DΣ(xk, yk). In general, when there are multiple replacements, string edit distance is based on just summing the effects. However, string-edit distance also allows a different kind of change as well: insertion and deletion. For example, we can define x k̂ to mean the string x with the k-th entry removed. It might be that x k̂ agrees perfectly with the string y, and so we assign d(x, y) = δ where δ is the deletion penalty. Similarly, insertions of characters are allowed to determine edit distance. Clearly, if y = x k̂ , then adding xk to y at the k-th position yields x. Again, the effect of multiple insertions/deletions is additive, and this allows strings of different lengths to be compared. The use of both replacements and instertion/deletions to determine edit distance means that an edit path from x to y is not unique. Edit distance is therefore defined by taking the minimum over all possible representations, as we define formally in (1.4). But this will not in general define a metric unless appropriate conditions on δ and DΣ are satisfied. These conditions can be defined by extending the alphabet Σ and metric DΣ to include a “gap” as a character, say “ ” (let Σ̃ denote the extended alphabet), and by assigning a distance De Σ(x, ) for each character x in the original alphabet. Theorem 9.4 of [1] says that de is a metric on strings of letters in Σ whenever De Σ is a metric on the extended alphabet.
منابع مشابه
Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics
Tree structured data often appear in bioinformatics. For example, glycans, RNA secondary structures and phylogenetic trees usually have tree structures. Comparison of trees is one of fundamental tasks in analysis of these data. Various distance measures have been proposed and utilized for comparison of trees, among which extensive studies have been done on tree edit distance. In this paper, we ...
متن کاملWord Similarity Calculation by Using the Edit Distance Metrics with Consonant Normalization
Edit distance metrics are widely used for many applications such as string comparison and spelling error corrections. Hamming distance is a metric for two equal length strings and Damerau-Levenshtein distance is a well-known metrics for making spelling corrections through string-to-string comparison. Previous distance metrics seems to be appropriate for alphabetic languages like English and Eur...
متن کاملA New Edit Distance for Fuzzy Hashing Applications
Similarity preserving hashing applications, also known as fuzzy hashing functions, help to analyse the content of digital devices by performing a resemblance comparison between different files. In practice, the similarity matching procedure is a two-step process, where first a signature associated to the files under comparison is generated, and then a comparison of the signatures themselves is ...
متن کاملAn Edit Distance Between RNA Stem-Loops
We introduce the notion of conservative edit distance and mapping between two RNA stem-loops. We show that unlike the general edit distance between RNA secondary structures, the conservative edit distance can be computed in polynomial time and space, and we describe an algorithm for this problem. We show how this algorithm can be used in the more general problem of complete RNA secondary struct...
متن کاملEmpirical Evaluation of Tree distances for Parser Evaluation
In this empirical study, I compare various tree distance measures – originally developed in computational biology for the purpose of tree comparison – for the purpose of parser evaluation. I will control for the parser setting by comparing the automatically generated parse trees from the stateof-the-art parser (Charniak, 2000) with the gold-standard parse trees. The article describes two differ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007